How the justice is related to race? Does the police officer behaviour leads to the injuries? Is the racism a premium factor in the crime incidents? These questions alongwith many other queries are needed to be answered from the dataset provided. It has many details related to officer, injuries, incident location, injuries caused, time and date of incident etc. We will discuss the multiple variables in the dataset to make a concise analysis to answer the questions mentioned above.
We can get an overview of the our dataset with the help of
head function. It gives us important information about the
data types of variables in the dataset which can help to determine what
variables we should keep. The information from this very initial step
can help for EDA analysis.
df <- read.csv("~/Documents/R_data_Visualizations/37-00049_UOF-P_2016_prepped.csv",na.strings = c(""))
datatable(df)
dim(df)
## [1] 2384 47
This step involves several step from getting duplicates in the data to the actual check of normalization of the data. In the 1st step we will remove the 1st column since it has already same variable names as the 1st row.
We observe there are character variables which can
converted to factors and double variables which can be
converted to numeric for plotting so we use the R package named
commonutiladdins to convert them to desired data type.
After converting the result is given as
df <- lapply(df, as.factor) %>% data.frame()
df$OFFICER_YEARS_ON_FORCE <- as.numeric(as.character(df$OFFICER_YEARS_ON_FORCE))
df$STREET_NUMBER <- as.numeric(as.character(df$OFFICER_YEARS_ON_FORCE))
df$SECTOR <- as.numeric(as.character(df$SECTOR))
df$LOCATION_LATITUDE <- as.numeric(as.character(df$LOCATION_LATITUDE))
df$LOCATION_LONGITUDE <- as.numeric(as.character(df$LOCATION_LONGITUDE))
datatable(diagnose(df))
From the above we can notice that there are columns with lots of missing values we can remove them with the help of code below
Nan valuesdf <- df %>% select(!matches("USED"))%>%
select(-c(LOCATION_CITY,LOCATION_STATE,NUMBER_EC_CYCLES, OFFICER_ID, SUBJECT_ID,BEAT, UOF_NUMBER))
Before the data visualization and normality check, we observe there
are variables with date format so we will use stringr
package to mutate new columns with separate day, date and hour for
incidents. It will help us to analyse the data further in data
visualization section.
df$INCIDENT_DATE <- as.Date(df$INCIDENT_DATE, format = "%m/%d/%Y")
df$INCIDENT_DATE <- gsub("00","20",df$INCIDENT_DATE)
df$INCIDENT_DATE <- as.Date(df$INCIDENT_DATE, format = "%Y-%m-%d")
df$INCIDENT_TIME <- format(strptime(df$INCIDENT_TIME, "%I:%M:%S %p"), "%H:%M:%S")
df$INCIDENT_MONTH <- months(as.Date(df$INCIDENT_DATE))
df$INC_MONTH <-format(df$INCIDENT_DATE,"%m")
df$INCIDENT_HOUR <- as.numeric(substr(df$INCIDENT_TIME, 0, 2))
df$INCIDENT_DAY <- wday(df$INCIDENT_DATE)
df$INC_HOUR <- substr(df$INCIDENT_TIME, 0, 2)
df$INC_DATE <- substr(df$INCIDENT_DATE, 9, 10)
## Create group of datas:
df_year <- df %>%
group_by(INCIDENT_DATE,INCIDENT_MONTH,INCIDENT_DAY) %>%
summarize(count = n())
df_month <- df %>%
group_by(INC_MONTH) %>%
summarize(count = n())
df_day <- df %>%
group_by(INCIDENT_DAY,INCIDENT_HOUR) %>%
summarize(count = n())
df$INC_HOUR <- substr(df$INCIDENT_TIME, 0, 2)
df %>% group_by(INC_HOUR) %>%
summarize(avg =n()) -> df_hour_n
For the EDA analysis we have used the package dlookr and
Dataexplorer of R. Following table provides us the
information about the central tendency of our dataset.
diagnose_numeric(df)
## # A tibble: 7 × 10
## variables min Q1 mean median Q3 max zero minus outlier
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 OFFICER_YEARS_ON_FO… 0 3 8.05 6 10 36 3 0 240
## 2 SECTOR 110 210 389. 350 610 750 0 0 0
## 3 STREET_NUMBER 0 3 8.05 6 10 36 3 0 240
## 4 LOCATION_LATITUDE 32.6 32.7 32.8 32.8 32.9 33.0 0 0 0
## 5 LOCATION_LONGITUDE -97.0 -96.8 -96.8 -96.8 -96.8 -96.6 0 2328 44
## 6 INCIDENT_HOUR 0 5 13.0 16 20 23 142 0 0
## 7 INCIDENT_DAY 1 2 4.05 4 6 7 0 0 0
We can check the outliers of the dataset with the help of boxplots for numeric variables.
df %>%
plot_outlier(diagnose_outlier(df) %>%
filter(outliers_ratio >= 0.5) %>%
select(variables) %>%
unlist())
<img src=“Policing_assignment_files/figure-html/unnamed-chunk-13-1.png” width=“960” 20% style=“display: block; margin: auto;” /><img src=“Policing_assignment_files/figure-html/unnamed-chunk-13-2.png” width=“960” 20% style=“display: block; margin: auto;” /><img src=“Policing_assignment_files/figure-html/unnamed-chunk-13-3.png” width=“960” 20% style=“display: block; margin: auto;” />
The corelation map given above can be extracted in a tabular form
correlate(df)
## # A tibble: 42 × 3
## var1 var2 coef_corr
## <fct> <fct> <dbl>
## 1 SECTOR OFFICER_YEARS_ON_FORCE 0.0182
## 2 STREET_NUMBER OFFICER_YEARS_ON_FORCE 1
## 3 LOCATION_LATITUDE OFFICER_YEARS_ON_FORCE 0.0839
## 4 LOCATION_LONGITUDE OFFICER_YEARS_ON_FORCE -0.0821
## 5 INCIDENT_HOUR OFFICER_YEARS_ON_FORCE -0.0329
## 6 INCIDENT_DAY OFFICER_YEARS_ON_FORCE 0.0286
## 7 OFFICER_YEARS_ON_FORCE SECTOR 0.0182
## 8 STREET_NUMBER SECTOR 0.0182
## 9 LOCATION_LATITUDE SECTOR 0.00273
## 10 LOCATION_LONGITUDE SECTOR -0.349
## # … with 32 more rows
df %>%
correlate() %>%
plot()
<img src=“Policing_assignment_files/figure-html/unnamed-chunk-15-1.png” width=“960” 20% style=“display: block; margin: auto;” />
normality(df)
## # A tibble: 7 × 4
## vars statistic p_value sample
## <chr> <dbl> <dbl> <dbl>
## 1 OFFICER_YEARS_ON_FORCE 0.811 1.91e-46 2383
## 2 SECTOR 0.912 3.56e-35 2383
## 3 STREET_NUMBER 0.811 1.91e-46 2383
## 4 LOCATION_LATITUDE 0.955 4.33e-26 2383
## 5 LOCATION_LONGITUDE 0.984 1.35e-15 2383
## 6 INCIDENT_HOUR 0.890 3.38e-38 2383
## 7 INCIDENT_DAY 0.900 6.97e-37 2383
The p-value for all the numeric variables is less than 0.05 at significance level of 5% so we consider that our data is normal after the steps given above.
We have not included the skewness plots in our dataset till now which is given below with a code snippet.
find_skewness(df)
## [1] 6 21 26
find_skewness(df, index = FALSE)
## [1] "OFFICER_YEARS_ON_FORCE" "STREET_NUMBER" "LOCATION_LATITUDE"
find_skewness(df, value = TRUE)
## OFFICER_YEARS_ON_FORCE SECTOR STREET_NUMBER
## 1.484 0.268 1.484
## LOCATION_LATITUDE LOCATION_LONGITUDE INCIDENT_HOUR
## 0.594 0.287 -0.462
So the major numeric variables of on duty years is skewed which need to analysed to remove skewness. Other 2 variables can be reject for skewnss removal since they do not weigh much in the analysis.
We can remove the skewness by
OFFICER_YEARS_ON_FORCE = transform(df$OFFICER_YEARS_ON_FORCE, method = "log")
summary(OFFICER_YEARS_ON_FORCE)
## * Resolving Skewness with log
##
## * Information of Transformation (before vs after)
## Original Transformation
## n 2383.0000000 2383.0000000
## na 0.0000000 0.0000000
## mean 8.0490978 -Inf
## sd 7.5624813 NaN
## se_mean 0.1549181 NaN
## IQR 7.0000000 1.2039728
## skewness 1.4852096 NaN
## kurtosis 1.4770790 NaN
## p00 0.0000000 -Inf
## p01 1.0000000 0.0000000
## p05 1.0000000 0.0000000
## p10 1.0000000 0.0000000
## p20 2.0000000 0.6931472
## p25 3.0000000 1.0986123
## p30 3.0000000 1.0986123
## p40 4.0000000 1.3862944
## p50 6.0000000 1.7917595
## p60 7.0000000 1.9459101
## p70 9.0000000 2.1972246
## p75 10.0000000 2.3025851
## p80 11.0000000 2.3978953
## p90 21.0000000 3.0445224
## p95 26.0000000 3.2580965
## p99 31.0000000 3.4339872
## p100 36.0000000 3.5835189
plotting the variables after skewness removal
plot(OFFICER_YEARS_ON_FORCE)
<img src=“Policing_assignment_files/figure-html/unnamed-chunk-19-1.png” width=“960” 20% style=“display: block; margin: auto;” />
Alongwith with individual steps described above we can create an automatic EDA report from the data.
The data visualization are greatly helpful for insights into the data. Following graphs will explain mainly the subject characteristics to the other parameters. We will explain each graph in a short sentences before each graph.
The below figure shows that the incidents were greatly reduced in the mornings hours.
p <- df_day %>%
filter(!is.na(INCIDENT_HOUR)) %>%
ggplot() +
aes(x = INCIDENT_HOUR, y = count, size = count) +
geom_point(shape = "circle open",
colour = "#B22222") +
geom_smooth(span = 0.75) +
ggthemes::theme_base()
ggplotly(p)
Incidents frequency at different hours
Following violin plots show that the incidents are greatly reducted in the month of feb as compared to other months. Moreover the variations from the mean can be greatly observed in the winter months.
ggplot(df_year) +
aes(x = count, y = INCIDENT_MONTH, fill = INCIDENT_MONTH) +
geom_violin(adjust = 1L,
scale = "area") +
scale_fill_hue(direction = 1) +
theme_minimal()
<img src=“Policing_assignment_files/figure-html/unnamed-chunk-22-1.png” width=“960” 20% style=“display: block; margin: auto;” />
Following graph describes the fact that Black Males are most likely to be involved in incidents.
p <- df %>%
filter(!(SUBJECT_RACE %in% "NULL")) %>%
filter(SUBJECT_GENDER %in% c("Female", "Male")) %>%
filter(!(INCIDENT_REASON %in%
"NULL")) %>%
filter(!(REASON_FOR_FORCE %in% "NULL")) %>%
filter(!is.na(INCIDENT_HOUR)) %>%
filter(!is.na(INC_HOUR)) %>%
ggplot() +
aes(x = SUBJECT_RACE, fill = SUBJECT_RACE) +
geom_bar() +
scale_fill_hue(direction = 1) +
ggthemes::theme_base() +
theme(legend.position = "bottom") +
facet_wrap(vars(SUBJECT_GENDER), ncol = 2L)+theme(axis.text.x = element_text(angle=45, vjust=2, hjust=1))
ggplotly(p)
Whenever there is chance of Arrest, the police officer will most likely use force as shown in graph below.
p <- df %>%
filter(!(SUBJECT_RACE %in% "NULL")) %>%
filter(SUBJECT_GENDER %in% c("Female", "Male")) %>%
filter(!(INCIDENT_REASON %in%
"NULL")) %>%
filter(!(REASON_FOR_FORCE %in% "NULL")) %>%
filter(!is.na(INCIDENT_HOUR)) %>%
filter(!is.na(INC_HOUR)) %>%
ggplot() +
aes(x = REASON_FOR_FORCE, fill = SUBJECT_GENDER, weight = OFFICER_YEARS_ON_FORCE) +
geom_bar() +
scale_fill_hue(direction = 1) +
ggthemes::theme_few() +
theme(legend.position = "bottom") +
facet_wrap(vars(OFFICER_INJURY),
scales = "free", ncol = 1L)+theme(axis.text.x = element_text(angle=45, vjust=2, hjust=1))
ggplotly(p)
Both the young officers and old officers will undergo injury during incidents.
p <- df %>%
filter(!(SUBJECT_RACE %in% "NULL")) %>%
filter(SUBJECT_GENDER %in% c("Female", "Male")) %>%
filter(!(INCIDENT_REASON %in%
"NULL")) %>%
filter(!(REASON_FOR_FORCE %in% "NULL")) %>%
filter(!is.na(INCIDENT_HOUR)) %>%
filter(!is.na(INC_HOUR)) %>%
ggplot() +
aes(x = OFFICER_YEARS_ON_FORCE, fill = OFFICER_INJURY, weight = OFFICER_YEARS_ON_FORCE) +
geom_density(adjust = 1L) +
scale_fill_hue(direction = 1) +
ggthemes::theme_base() +
theme(legend.position = "bottom")
ggplotly(p)
The below graph show that there is no clear corelation between officer race and subject race during incidents. It can be equal for all cases although ratio of males is much higher to be involved in incidents.
p <- df %>%
filter(!(SUBJECT_RACE %in% "NULL")) %>%
filter(SUBJECT_GENDER %in% c("Female", "Male")) %>%
filter(!(INCIDENT_REASON %in%
"NULL")) %>%
filter(!(REASON_FOR_FORCE %in% "NULL")) %>%
filter(!is.na(INCIDENT_HOUR)) %>%
filter(!is.na(INC_HOUR)) %>%
ggplot() +
aes(x = SUBJECT_RACE, y = OFFICER_RACE, fill = OFFICER_GENDER) +
geom_tile(size = 0.5) +
scale_fill_hue(direction = 1) +
ggthemes::theme_base() +
theme(legend.position = "bottom")+theme(axis.text.x = element_text(angle=45, vjust=2, hjust=1))
ggplotly(p)
We can check from the below graph that if the injury caused to the officer is related to the subject arrest. It was considered as a hypothesis that injury caused to the officer by subject in incident may lead to arrest and this hypothesis seems clealy rejected.
p <- ggplot(df) +
aes(x = SUBJECT_WAS_ARRESTED) +
geom_bar(fill = "#112446") +
labs(title = "Officer injury relation with the Subject arrest") +
ggthemes::theme_base() +
facet_wrap(vars(OFFICER_INJURY))
ggplotly(p)
Data Analysis has been conducted for the Dallas, USA Police enquity dataset. Several EDA steps are conducted to clean the data which was verified in the normality and skewness tests. Afterwards the data visualization show that Black males are mostly to be involved in incidents as compared to Hispanics and White males. The number of asians in incidents are on 2nd rank. The Gender of officers will most likely to have no effect on the arrests. Similar trend is found for the race of officer in relation to race of subject.